In Sections 12.1 - 12.3 we saw how supervised and unsupervised learners alike can be extended to perform nonlinear learning via the use of arbitrary linear combination of nonlinear functions / feature transformations. However the general problem of engineering an appropriate nonlinearity for a given dataset has thus far remained elusive. In this Section we introduce the first of two major tools for handeling this task: universal approximators. Universal approximators are families of simple nonlinear feature transformations whose members can be combined to create artbitrarily complex nonlinearities like any we would ever expect to find in a supervised or unsupervised learning dataset. Here we will also introduce the three standard types of universal approximators employed in practice today - kernels, neural networks, and trees - of which we will have much more to say of in future Chapters.
Virtually every complicated object in the world around us can be decomposed into simpler elements. An automatible consists of components like windows, tires, an engine, and so on. A car engine can be broken down a combination of yet simpler components, into e.g., a set of pistons, an alternator, etc., Even the base alloys used to construct these engine components can be broken down into combinations of metals, and these metals into combinations of elements from the periodic table.
Complicated mathematical objects - i.e., mathematical functions - can be similarly broken down into combinations of simpler elements, i.e., into combinations of simpler mathematical functions. Stated more formally, any nonlinear mathematical function $h\left(\mathbf{x}\right)$ can be broken down into / approximated by a linear combination of $B$ 'simple' functions $f_1,\,f_2,\,...\,f_B$ in general as
\begin{equation} h\left(\mathbf{x}\right) = w_0 + f_1\left(\mathbf{x}\right){w}_{1} + f_2\left(\mathbf{x}\right){w}_{2} + \cdots + f_B\left(\mathbf{x}\right)w_B \end{equation}where each of the simpler nonlinear functions on the right hand side can have internal parameters. This is actually a rather old mathematical fact (e.g., with roots in the 1700s) which we call universal approximation in the machine learning / deep learning world. These simple functions are called universal approximators.
For example, below we show a continuous nonlinear function $h$ in black, and approximate this function using $B = 200$ universal approximators. As you move the slider from left to right you will see a red curve jump onto the screen - this is the combination of our $200$ universal approximators shown first with a poor choice of parameter settings. As yo move the slider from left to right we illustrate how the shape of this combination improves (in terms of how it approximates $h$) as we improve its parameters. Eventually - moving the slider all the way to the right where we show an especially good setting of these parameters - the combination approximates $h$ quite well (and indeed if we kept refining its parameters it could perfectly represent $h$).
Even functions with discrete jumps - like a step function - can be approximated infinitely closely using a combination of universal approximators. Below we show a similar animation to the one above, only there we approximate a step function with nonlinear jumps. Here we use $B=100$ universal approximators, and once again as you move the slider from left to right the parameters of the combination are improved so that the approximation looks more and more like the function $h$ itself.
What are these 'simple' universal approximators $f_1,\,f_2,\,...\,f_B$? How many do we need (how large does $B$ have to be) for a given nonlinear $h$? How do we set the parameters of this linear combination correctly (including possible internal parameters of the the functions $f_1,\,f_2,\,...\,f_B$)? What does this have to do with determining a proper nonlinearity for a supervised/unsupervised dataset? We address each of these questions below.
There is an enormous variety of simple functions $f_1,\,f_2,\,...\,f_B$ that fall into the category of universal approximators. Typically these are organized into three families or catalogs of similar functions called kernels, neural networks, and trees respectively. Roughly speaking, using an exemplar from any one of these families we can perform the kind of universal approximation detailed above.
The family of kernel functions consists of groups of functions with no internal parameters, a popular example being polynoimals. When dealing with just one input this sub-family of kernel functions looks like
\begin{equation} f_1(x) = x, ~~ f_2(x) = x^2,~~ f_3(x)=x^3,... \end{equation}and so forth with the $D^{th}$ element taking the form $f_D(x) = x^D$. A combination of the first $D$ units from this sub-family is often referred to as a degree $D$ polynomial. There are an infinite number of these functions - one for each positive whole number - and they are naturally ordered by their degree (the higher the degree, the further down the list that function is). Moreover each element has a fixed shape - e.g., the monomial $f_2(x) = x^2$ is always a parabola, and looks like a cup facing upwards. In the next Python cell we plot the first four elements (also called units) of the polynomial sub-family of kernel functions.
In two inputs $x_1$ and $x_2$ polynomial units take the analagous form
\begin{equation} f_1(x_1,x_2) = x_1, ~~ f_2(x_1,x_2) = x_2^2, ~~ f(x_1,x_2) = x_1x_2^3, ~~ f(x_1,x_2) = x_1^4x_2^6,... \end{equation}with a general degree $D$ unit taking the form
\begin{equation} f_m(x_1,x_2) = x_1^px_2^q \end{equation}where where $p$ and $q$ are nonnegative integers and $p + q \leq D$. A degree $D$ polynomial in this case is a linear combinatino of all such units. This sort of definition generalizes to defining polynomial units in general $N$ dimensional input as well. In the Python cell below we draw a sampling of these polynomial units.
Another classic sub-family of kernel universal approximators: sine waves of increasing frequency. This consists of the set of sine waves with frequency increasing by an e.g., integer factor like
$$f_1(x) = \text{sin}(x), ~~ f_2(x) = \text{sin}(2x), ~~ f_3(x) = \text{sin}(3x), ...$$where the $m^{th}$ element given as $f_m(x) = \text{sin}(mx)$.
In the next Python cell we plot the table of values for the first four of these catalog functions using their equations.
As with the polynomials, notice how each of these catalog of elements if fixed. They have no tunable parameters inside, the third element always looks like $f_3(x) = \text{sin}(x)$ - that is it always takes on that shape. Also note, like polynomials to generalize this catalog of functions to higher dimensional input we shove each coordinate through the single dimensional version of the function separately. So in the case of $N=2$ inputs the functions take the form
\begin{equation} f_1(x_1,x_2) = \text{sin}(x1), ~~ f_1(x_1,x_2) = \text{sin}(2x_1)\text{sin}(5x_2), ~~ f_3(x_1,x_2) = \text{sin}(4x_1)\text{sin}(2x_2), ~~ f_4(x_1,x_2) = \text{sin}(7x_1)\text{sin}(x_2), ~~ ... \end{equation}And these are listed in no particular order, and in general we can write a catalog element as $f_m(x_1,x_2) = \text{sin}(px_1)\text{sin}(qx_2) $ where $p$ and $q$ are any nonnegative integers.
Another family of universal approximators are neural networks. Broadly speaking neural networks consist of parameterized functions whose members - since they are parameterized - can take on a variety of different shapes (unlike kernel functions which each take on a single fixed form). A sub-family of neural networks typically consists of a set of parameterized functions of a single type. For example, a common single layer neural network basis consists of parameterized $\text{tanh}$ functions
\begin{equation} f_1(x) = \text{tanh}\left(w_{1,0} + w_{1,1}x\right), ~~ f_2(x) = \text{tanh}\left(w_{2,0} + w_{2,1}x\right), ~~ f_3(x) = \text{tanh}\left(w_{3,0} + w_{3,1}x\right), ~~ f_4(x) = \text{tanh}\left(w_{4,0} + w_{4,1}x\right), ... \end{equation}Notice that because there are parameters inside the $\text{tanh}$ the $b^{th}$ such function $f_b$ can take on a variety of shapes depending on how we set its internal parameters $w_{b,0}$ and $w_{b,1}$. We illustrate this in the next Python cell by randomly setting these two values and plotting the table of the associated function.
To handle higher dimensional input we simply take a linear combination of the input, passing the result through the nonlinear function. For example, an element $f_j$ for general $N$ dimensional input looks like
\begin{equation} f_j\left(\mathbf{x}\right) = \text{tanh}\left(w_{j,0} + w_{j,1}x_1 + \cdots + w_{j,\,N}x_N\right) \end{equation}Choosing another elementary function gives another sub-catalog of single-layer neural network functions. The rectified linear unit (or 'relu' for short) is another popular example, elements of which (for single dimensional input) look like
\begin{equation} f_1(x) = \text{max}\left(0,w_{1,0} + w_{1,1}x\right), ~~ f_2(x) = \text{max}\left(0,w_{2,0} + w_{2,1}x\right), ~~ f_3(x) = \text{max}\left(0,w_{3,0} + w_{3,1}x\right), ~~ f_4(x) = \text{max}\left(0,w_{4,0} + w_{4,1}x\right), ... \end{equation}Since these also have internal parameters each can once again take on a variety of shapes. Below we plot $4$ instances of such a function, where in each case its internal parameters have been set at random.
Other 'deeper' elements of the neural network family are constructed by recursively combining single layer elements. For example, to create a two-layer neural network function we take a linear combination of single layer elements, like the $\text{tanh}$ ones shown above, and pass the result through same $\text{tanh}$ function e.g., summing $10$ elements and passing the result through $\text{tanh}$ gives a 'two-layer' function. With this we have created an even more flexible function, since each internal single-layer function $f_j$ also has internal parameters, which we illustrate by example below. Here we show $4$ random instances of the above neural network function, where in each instance we have randomly set all of its parameters. As can be seen below, these instances are far more flexible than the single-layer ones.
We go into substantially further detail in discussing this family of universal approximators in Chapter 13.
Like neural networks, a single element from the family of tree-based universal approximators can take on a wide array of shapes. The simplest sort of tree basis consists of discrete step functions or, as they are more commonly referred to, stumps whose break lies along a single dimension of the feature space. A stump with 1-dimensional input $x$ can be written as
\begin{equation} f_1(x) = \begin{cases} x < V_1 \,\,\,\,\, a_1 \\ x \geq V_1 \,\,\,\,\, b_1 \end{cases} ~~~~~~~~ f_2(x) = \begin{cases} x < V_2 \,\,\,\,\, a_2 \\ x \geq V_2 \,\,\,\,\, b_2 \end{cases} ~~ \cdots \end{equation}where $V_{1}$ is split point at which the stump changes values, and $y_{1}$ and $y_{2}$ are values taken by the two sides of the stump, respectively, which we refer to as levels of the stump.
In the python cell that follows we plot four instances of such a stump basis element - where we can see how each one takes on a wide variety of shapes.
In a perfect world our desired approximations $\approx$ can attain a strict equality $=$ i.e., $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = y_p$. In such a perfect instance the dataset lies precisely on a line (or more generally, a hyperplane). The most perfect dataset we could possibly have for linear regression is then - ignoring computation - a line (or hyperplane) itself (as illustrated below in the animated gif).
In complete analogy to the linear case, here our perfect dataset would consist of points where $\text{model}\left(\mathbf{x}_p, \mathbf{w}\right) = y_p$, i.e., points lying perfectly on the nonlinear curve (or in general the nonlinear surface) defined by $f$. The most perfect dataset we could possibly have for nonlinear regression is then - ignoring computation - a curve (or surface) itself (as illustrated below in the animated gif).
Examining our setup, here for linear two-classification our perfect dataset would consist of points where $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = y_p$, i.e., points lying perfectly on the step function with linear boundary given by $\text{model}\left(\mathbf{x},\mathbf{w}\right) = 0$. The most perfect dataset we could possibly have for linear classification is then - ignoring computation - a perfect step function (with linear boundary) itself. This is illustrated below in the animated gif.
Examining our setup, here for nonlinear two-classification our perfect dataset would consist of points where the nonlinear decision boundary given by $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = 0$ perfectly separates the data. In other words, that our points lie perfectly on the step function with this nonlinear boundary. The most perfect dataset we could possibly have for nonlinear classification is then - ignoring computation - a perfect step function (with nonlinear boundary) itself. This is illustrated below in the animated gif.
3D CLASSIFICATION FIT GOES HERE
In this example we animate various predictions of $B$ polynomial units, which generally take the form
\begin{equation} \text{predict}\left(\mathbf{x}, \omega \right) = w_0 + {w}_{1}\,f_1\left(\mathbf{x}\right) + {w}_{2}\,f_2\left(\mathbf{x}\right) + \cdots + w_B\,f_B\left(\mathbf{x}\right). \end{equation}to a variety of simple datasets. Notice in each how - due to the fact that these features come from the same sub-family of related nonlinear functions - that the corresponding predictions change fairly gradually as additional units are added to the model (in other words - we how can fine-tune the amount of nonlinearity we want in our prediction).
First in the next cell we animate the final trained predictions of fitting first $B = 1$ through $B = 4$ polynomial units to the noisy sinusoid dataset shown in the previous Subsection. As the slider is moved from left to right one additional polynomial unit is added to the model, the linear combination of units is fit to the dataset, and the resulting fit is shown on the data in the left panel. In the right panel we simultaenously show the Least Squares cost function value of each trained model.
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='poly',num_elements = [v for v in range(1,5)])
Next, below in the following Python cell we animate the final trained predictions of fitting first $B = 1$ through $B = 10$ polynomial units to a three-dimensional noisy sinusoid dataset. As the slider is moved from left to right one additional polynomial unit is added to the model, the linear combination of units is fit to the dataset, and the resulting fit is shown on the data. We do not plot the corresponding Least Squares cost function value, but as with the above example it decreases as we add more polynomial units to the model.
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_units = [v for v in range(1,10)] ,view = [20,110],basis = 'poly')
And finally, an analagous example with 2-class classification. As you move the slider from left to right additional polynomial units are added to the model. In this case each notch of the slider adds multiple units - as we are increasing in $D$ the degree of the total polynoimal (which, for two inputs, means adding several individual polynomiali units each time $D$ is increased).
# load in dataset http://math.arizona.edu/~dsl/
csvname = datapath + '2_eggs.csv'
demo = nonlib.classification_basis_comparison_3d.Visualizer(csvname)
# run animator
demo.brows_single_fits(num_units = [v for v in range(0,4)], basis = 'poly',view = [30,-80])
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='tanh',num_elements = [v for v in range(1,5)])
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_elements = [v for v in range(1,12)] ,view = [20,110],basis = 'net')
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()
# plot the first 4 elements of the polynomial basis
demo.show_1d_tree(depth = 1)
demo = nonlib.stump_visualizer_2d.Visualizer()
csvname = datapath + 'noisy_sin_raised.csv'
demo.load_data(csvname)
demo.browse_stumps()
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='tree',num_elements = [v for v in range(1,10)])
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_elements = [v for v in range(1,20)] ,view = [20,110],basis = 'tree')
demo = nonlib.regression_basis_comparison_2d.Visualizer()
csvname = datapath + 'sin_function.csv'
demo.load_data(csvname)
demo.brows_fits(num_elements = [v for v in range(1,50,1)])
demo = nonlib.regression_basis_comparison_2d.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(cvsname)
demo.brows_fits(num_elements = [v for v in range(1,25)])
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='poly',num_elements = [v for v in range(1,25)])
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_cross_val(basis='poly',num_elements = [v for v in range(1,10)],folds = 3)
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_cross_val(basis='tanh',num_elements = [v for v in range(1,10)],folds = 3)
